Segmenting DNA sequence into 'words' based on statistical language model

نویسنده

Wang Liang

چکیده

[Abstract] This paper presents a novel method to segment/decode DNA sequences based on n-gram statistical language model. Firstly, we find the length of most DNA “words” is 12 to 15 bps by analyzing the genomes of 12 model species. The bound of language entropy of DNA sequence is about 1.5674 bits. After building an n-gram biology languages model, we design an unsupervised ‘probability approach to word segmentation’ method to segment the DNA sequences. The benchmark of segmenting method is also proposed. In cross segmenting test, we find different genomes may use the similar language, but belong to different branches, just like the English and French/Latin. We present some possible applications of this method at last.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Segmenting DNA sequence into `words'

[Abstract] This paper presents a novel method to segment/decode DNA sequences based on statistical language model. Firstly, we find the length of most DNA “words” is 12 to 15 bps by analyzing the genomes of 12 model species. Then we apply the unsupervised approach to build the DNA vocabulary and design DNA sequence segmentation method. We also find different genomes is likely to use the similar...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

Statistical Language Processing Using Hidden Understanding Models

This paper introduces a class of statistical mechanisms, called hidden understanding models, for natural language processing. Much of the framework for hidden understanding models derives from statistical models used in speech recognition, especially the use of hidden Markov models. These techniques are applied to the central problem of determining meaning directly from a sequence of spoken or ...

متن کامل

Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach

This paper shows that in the context of statistical weblog classification for splog filtering based on n-grams of tokens in the URL, further segmenting the URLs beyond the standard punctuation is helpful. Many splog URLs contain phrases in which the words are glued together in order to avoid splog filtering techniques based on punctuation segmentation and unigrams. A technique which segments lo...

متن کامل

A Stochastic Japanese Morphological Analyzer Using a Forward-DP Backward-A* N-Best Search Algorithm

We present a novel method for segmenting the input sentence into words and assigning parts of speech to the words. It consists of a statistical language model and an efficient two-pa~qs N-best search algorithm. The algorithm does not require delimiters between words. Thus it is suitable for written Japanese. q'he proposed Japanese morphological analyzer achieved 95. l% recall and 94.6% precisio...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1202.2518 شماره

صفحات -

تاریخ انتشار 2012

Segmenting DNA sequence into 'words' based on statistical language model

نویسنده

چکیده

منابع مشابه

Segmenting DNA sequence into `words'

A new model for persian multi-part words edition based on statistical machine translation

Statistical Language Processing Using Hidden Understanding Models

Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach

A Stochastic Japanese Morphological Analyzer Using a Forward-DP Backward-A* N-Best Search Algorithm

عنوان ژورنال:

اشتراک گذاری